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ABSTRACT 


In this paper we present the Reproducible Research Publication Workflow (RRPW) as an example of how 
generic canonical workflows can be applied to a specific context. The RRPW includes essential steps between 
submission and final publication of the manuscript and the research artefacts (i.e., data, code, etc.) that 
underlie the scholarly claims in the manuscript. A key aspect of the RRPW is the inclusion of artefact review 
and metadata creation as part of the publication workflow. The paper discusses a formalized technical 
structure around a set of canonical steps which helps codify and standardize the process for researchers, 
curators, and publishers. The proposed application of canonical workflows can help achieve the goals of 
improved transparency and reproducibility, increase FAIR compliance of all research artefacts at all steps, 
and facilitate better exchange of annotated and machine-readable metadata. 


* Corresponding author: Limor Peer (Email: limor.peer@yale.edu; ORCID: 0000-0002-3234-1593). 
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1. INTRODUCTION 


In recent years, there has been a growing expectation by research funders and journals that researchers 
share their data and code in efforts to encourage computational reproducibility and replicability [1, 2]°®. 
Open Science initiatives [3, 4] also contribute to a growing expectation that the research artefacts (i.e., 
data, code, etc.) be made available alongside the manuscript for review [5, 6]. A number of journals are 
putting in place new policies that incorporate verification of the computational reproducibility of results 
reported in the manuscript prior to and as a condition of publication, for example, the American Journal 
of Political Science®, or the American Economic Review® in the social sciences. These efforts are driven by 
the imperative to improve the transparency and documentation of the processes that led to research results 
based on data, to verify and enable future reproducibility, to facilitate replication studies, and to increase 
the level of Findability, Accessibility, Interoperability, and Reusability, or “FAIRness” [7], at all steps. 
Additional goals are to foster (transdisciplinary) reuse and to gain new knowledge from existing data. 


Promoting these goals requires a redefining of the evidence base that substantiates published data-based 
scholarly claims. Whereas open data sharing practices are becoming more widespread, computational 
reproducibility sets even higher expectations for research transparency because it requires code sharing 
and execution. More specifically, it requires a set of digital files that supports the scholarly manuscript and 
includes the data, the code (software code of the conducted procedures, program files of the data analysis), 
and associated documents such as a codebook, README, and other supporting documentation and 
metadata. This set of files can be described as a research compendium [8] and it represents a more complete 
scholarly record [9]. 


Publishing reproducible research requires an update to the manuscript publication workflow. Key 
components of the traditional scholarly publication workflow—manuscript submission, peer review, 
editorial decision, and publication—are well established. But the review of associated files, documentation, 
data, and code as part of this workflow is a relatively new development [10], albeit one that we predict 
will become commonplace. 


In this paper, we describe an updated manuscript review and publication workflow, the Reproducible 
Research Publication Workflow (RRPW). The workflow introduces procedures for quality review of the 
research artefacts necessary to reproduce the article’s findings (i.e., data, code, etc.), while also considering 
these artefacts as bound together into a unitary object for dissemination, interpretation, and reuse. We apply 
the Canonical Workflow Framework for Research (CWFR) approach to formalize the compulsory steps 
involved in the publication of reproducible research and identify recurring steps associated with artefact 
review. The CWFR approach requires codification of canonical steps and adherence to technical standards 
that support and enforce FAIR principles [11]. 


® Following NASEM (2019), we define reproducibility as “obtaining consistent results using the same input data; computational 
steps, methods, and code; and conditions of analysis. This definition is synonymous with ‘computational reproducibility’. . .” 
We define replicability as “obtaining consistent results across studies aimed at answering the same scientific question, each 
of which has obtained its own data.” 

® AJPS verification policy at https://ajps.org/ajps-verification-policy/ (accessed 20 January 2022) 

® AER data and code policy at https://www.aeaweb.org/journals/data/data-code-policy (accessed 20 January 2022) 
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We also harness the technological affordances of the FAIR Digital Object (FDO) structure [12] to establish 
the compendium itself as a robust digital object. The objective is to facilitate seamless reuse of high-quality 
research output in the future by, a) improving transparency and reproducibility of the research process, 
b) increasing FAIR compliance of all research artefacts at all steps, and c) enabling the exchange of annotated 
and machine-readable metadata produced by component parts and participating platforms. One overarching 
aim of the CWFR and RRPW is to highlight key automated steps throughout the research lifecycle so as to 
reduce the burden for all stakeholders along the way to producing published reproducible research. 


2. EFFORTS TO ENSURE THE REPRODUCIBILITY OF PUBLISHED RESEARCH RESULTS 


We argue that the publication workflow should include a quality review of the full research compendium 
for the purpose of verifying computational reproducibility and enhancements to metadata in order to follow 
the FAIR principles. The workflow must be informed by researchers, data curators and archivists, publishers, 
and infrastructure experts. 


A manuscript publication and compendium review process that has the stated goal of computationally 
reproducing the claims reported in the associated manuscript requires new methods and processes. Christian 
et al. [10] describe a process which consists of six essential, or canonical, steps (Figure 1): A manuscript 
submitted for peer review may lead to a conditional acceptance (1, 2) which will trigger a request to submit 
the compendium containing data, code, software, and documentation (3). The materials will undergo a 
curation and reproducibility verification process which, if successful (4), may lead to the final acceptance 
of the article (5). This then leads to the publication of the manuscript in appropriate journals and the 
publication of data and software in designated data repositories (6). This process is used by several social 
science journals that publish quantitative research using statistical analysis methods. It is important to note 
that the generic nature of this process allows for parallel or inverse steps, e.g., some high-ranking journals 
may require data and code submission before or at the same time as manuscript submission. 


Manuscript Conditional Compendium Curation + Final 
submission acceptance submission verification acceptance 


Figure 1. Integrated manuscript publication and compendium review workflow. 


Peer et al. [13] describe in detail the activities performed as part of the curation and verification process. 
The goal of these activities is to ensure that digital artefacts supporting a scholarly claim meet quality 
standards for reproducibility and for sharing and archival preservation. The review activities are highly 
asynchronous and require interactions between humans (e.g., curators and authors) which should be 
supported by software where possible to increase efficiency. In particular, the data review and code review 
steps may present significant challenges [14]. Code review and reproducibility verification, for example, 
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could lead to the identification of errors which would then lead to new calculations and a new paper 
submission, increasing the length of the overall review time. 


One mechanism that can help implement steps in this typical workflow is the Yale Application for 
Research Data (YARD). YARD manages and tracks metadata production and other quality review activities 
that contribute to a “FAIRer” research and exposes the review process to scrutiny [15, 16]. Figure 2 
illustrates the actions of both machine and human actors (i.e., curators and authors) and the interactions 
between these two types of actors. Note that some curators’ activities are being done “offline”, i.e., the 
curator uses manual or (semi-)automatic procedures that return “quality indicators” including formal 
specifications allowing machines to act. YARD and similar tools can ensure the compendium as a whole, 
and its component parts hold necessary properties of a FAIR Digital Object that allow distributed actors to 
execute canonical steps. 
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Figure 2. YARD curation and review steps. 


3. THE REPRODUCIBLE RESEARCH PUBLICATION WORKFLOW 


The Reproducible Research Publication Workflow (RRPW) describes how the research compendium is 
processed during the manuscript submission, review, and publication workflow. The RRPW embeds quality 
review of the research compendium into the manuscript review and publication process building on the 
process described in Christian et al. [10] and Peer et al. [13]. 
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It is important to note that the RRPW is adjacent to the preceding data generating and data analytic 
workflows, the parallel data management workflow (with substantial intersections), and the subsequent 
(transdisciplinary) data reuse workflows. While the RRPW aims to achieve the publication of reproducible 
research, it is not solely the purview of publishers. Indeed, the RRPW relies on authors and allied 
professionals, such as data curators and stewards, to, for example, contribute metadata from the beginning 
of the research lifecycle, in order to achieve metadata-rich published materials. In comparison to the other 
workflows, the RRPW has special features. Unlike the typical data-generating workflow, the RRPW involves 
additional actors aside from the researcher who performed the data collection and analysis presented in 
the article. These external actors may include individuals from the researcher's own institution, the journal, 
or the data archive, who each execute assigned RRPW tasks. This is largely dependent on funder requirements, 
the data management plan (DMP), the quality assurance procedures, and the curation and archiving strategy. 


3.1 RRPW and FDOs 


Ideally, the research compendium not only contains the requisite files, but also it encapsulates all other 
elements that would qualify it as a Fair Digital Object (FDO). FDOs are machine-actionable units that 
bundle the data with all components necessary for identifying, rendering, interpreting, and accessing the 
data [12]. They are represented by a bitstream, referenced an identified by a persistent ID, and have properties 
described by metadata. The FDO model for research specifies the properties of these data packages that 
satisfy both FAIR principles and the needs and expectations of the scholarly community. As an application 
of the CWFR framework, the RRPW is designed to output research compendia as FDOs [17, 18] by 
establishing the necessary activities to ensure that the research compendium and its component parts are 
identified by universally unique, persistent and resolvable identifiers, that they are associated with 
comprehensive metadata, and that they are stewarded by a trusted repository for long periods of time. 


The RRPW establishes standardized processes for subsequent data sharing and reuse workflows, thus 
increasing the efficiency and thereby reducing the cost of labor-intensive data curation and verification 
procedures. In doing so, the RRPW can help streamline these tasks (which are often decentralized, with 
workflow steps distributed across the various actors working in their own institutions), and the interactions 
and/or dependencies among them. Moreover, it generates a portable FDO, which enables external actors 
to engage with the canonical workflow as required, while also ensuring that the research community has 
access to the FDO. 


The RRPW also helps anticipate the subsequent data reuse workflow and address any special processing 
needs for data intended for transdisciplinary reuse (e.g., digitization of cultural heritage in archaeology). 
Here, additional workflow steps are necessary to capture comprehensive contextual information in a 
Project-Metadata-Digital Object, or MD-DO after data generating. The metadata schema must also make 
the digital objects findable and accessible for researchers from all target disciplines. If transdisciplinary 
reuse is intended, these criteria must be incorporated into the quality assessment and reflected in the 
appraisal system and batches. Above all, the involvement of other disciplines raises the question of 
overarching standards and rules. 
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To the extent that journals are using repositories committed to long term preservation and archiving in 
accordance with industry standards, e.g., ISO 14721: The Reference Model for an Open Archival Information 
System [19], they are already making use of standard actions that can have different implementations. For 
example, journals may have different criteria around file formats, metadata, the use of licensed statistical 
software, repositories for storing the artefacts, methods of linking to these repositories, and different workflow 
management protocols, including whether to involve third parties. A CWFR framework—which unifies 
standard actions and workflow technologies, as well as facilitates FAIR principles—is useful for the 
manuscript publication and artefact review workflow. 


3.2 RRPW Canonical Components 


We use the same diagram style used in other cases included in this journal issue to introduce the 
transition to FDOs (Figure 3). We indicate human responses by pale blue boxes and use the term “package” 
(instead of “file” and “config file”) which is defined by the MD-DO and refers to all relevant documents 
including those that are added by the curators. To simplify, we do not indicate all possible feedback loops 
where, for example, a data curation step leads to a request for the authors to improve the metadata. 
It should be noted, however, that the researcher does not have to start from scratch in certain cases but 
could refer to the appropriate MD-DO object and adapt it or its components. Some of these updates are 
so simple that they can be carried out in the “perform updates” action (see below). 


request request 
start CWFR paper ingest data & code perform publish end CWFR 
project review package review curation all project @ 


i ee. 7 y be ee. ee 


package package PAULL 
storage 


Figure 3. Reproducible Research Publication Workflow (RRPW). 
The process is comprised of canonical steps and associated actions, potentially performed by an editorial 


system. The MD-DOs, capturing the state at each step and replacing the “configuration” file in RRPW, 
are first-class citizens on the Internet, are themselves FAIR, and not hidden in some tool driven database. 
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Actions can be amended with specific library packages that deal with special requirements dependent on 
institution and community. Of course, it could be a simple and temporary modification first to maintain 
the configuration file and to create in addition an MD-DO. Crucial for “FAIRness” is that all references 
made in the MD-DOs are based on Handles® that are well-supported. This suggestion extends to the 
reviewer-author interaction, which can also be collected in an FDO that is cumulatively growing and 
referred to by the MD-DO. 


The following are the various steps in this canonical workflow, indicated in Figure 3 by the seven blue 
arrows, and associated actions, represented below in bullet points. 


Step 1: Start CWFR project. A researcher initiates a reproducible research publication project by 
submitting a paper to an editorial system resulting in a first MD-DO after having prepared all needed 
information (i.e., the research compendium), including metadata (assuming journal policy and guidelines 
supporting FAIR are established). 


e Researchers submit paper to an editorial system; 
e System assigns an ID number to the manuscript. 


Step 2: Request paper review. The paper component of the package is sent to reviewers and their 
comments are being collected. After some iterations, the review process may result in its approval. The 
reviewers’ comments are aggregated in a Review FDO which is also referred to via the MD-DO. 


e Editorial system initiates peer review process for manuscript (editor selects reviewers, including 
reviewer comments). 


Step 3: Ingest package. The user uploads all information via an ingest front-end of a curation tool or 
curation-enabled platform into a temporary store (workspace), where the package is established as a record 
(i.e., an abstract representation of the package, its components, and their relationship). The uploaded 
package is given initial archival processing treatment to enable review of the package as an FDO, and the 
MD-DO is enhanced to include the reference to the package FDO. In addition, the usual information is 
added such as a time stamp etc. (note: might be sequential to or part of Step 1): 


e Establish the record in the system; 

e Assign an ID number to the package as a unique record (persistent identifier, or PID, including the 
version); 

e Create metadata for record; 

e Assign an ID number to the files as unique objects (PID) within the record (versions); 

e Establish relationship among the files (e.g., via metadata); 

e Create metadata for each file/component (file e.g., file type) including checksum; 


® On the relationship between Handles and Digital Object Identifiers, see https://www.doi.org/factsheets/DOIHandle.html. 
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e Link all relevant files to manuscript to define the record (establish a compendium via packaging 
system/software or container); and 
e Identify unstructured documentation (e.g., README, codebook, data dictionary). 


Step 4: Request data and code review. An action to initiate a data and code review is taken. 


e Editorial system initiates quality review process for research compendium (editor selects reviewers 
(could be peer or specialized), including reviewer comments). 


Step 5: Perform curation. Action is taken to verify the computational reproducibility of data and code, 
including some iterations as indicated by the green arrow. This may result in updates to the package, such 
as new metadata, files, or versions of the manuscript (note that the audit trail of curation actions and updates 
should be linked with the FDO to document the process). This action will eventually lead to a final 
acceptance of the whole submitted package. 


e Enhance metadata. 


Step 6: Publish all. Upon approval, the newly created package which is associated preferably with a 
final FDO including all information and references that are needed for reuse is deposited into a trustworthy 
repository. 


e Push the package to a repository via API (capture actions); 
e Publish manuscript on journal website; and 
e Issue a bi-directional link manuscript to research compendium. 


Step 7: End CWFR project. After some checks the project may be closed. 
e Apply long-term preservation policy (if any). 


Recall that the concept of FDO always includes an associated PID (e.g., Handle or DOI) and some 
metadata describing the nature of FDOs bitstream. In addition, to ensure compliance with data protection 
laws and still be able to automate the reproducible research publication workflow to the greatest extent, 
we argue that it is indispensable to tag the FDO with additional information. For example, the Weblicht 
software [20] might be used to define an error standard. 


3.3 Extensions of the Generic RRPW 


As indicated above, the RRPW may vary by community. Here, we will mention four cases which 
nevertheless show the generality of the chosen approach. 


Case A: Access to data. The volume of the data may be so high, or access to the data may be otherwise 
restricted, that it is not recommended or practical to copy them. In this case the MD-DO would include a 
reference to the externally stored data, i.e., the package would include some metadata but not the data. 
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The workflow would proceed as usual with two exceptions: (1) The data review would need to have access 
to the remotely stored data and ability to execute tests on them, and (2) the publishing step would need to 
include a check of the quality of the repository, i.e., determine whether it is a trustworthy one (i.e., certified 
as CoreTrustSeal) that supports “FAIRness.” In case of a negative result a warning needs to be sent to the 
researcher and the repository team to take remedial action per repository policies. Finally, this step would 
result in publishing the FDO that then may include references to external repositories. 


Case B: Multiple datasets. For example, in the social sciences, data on large-scale surveys such as a 
census or a nationwide representative study are generated by multiple research groups in independent 
projects. These datasets are assembled into collections, meaning that a data file and its metadata descriptions 
are replaced by a collection DO and additional files. However, this does not change the canonical workflow. 
The collection DO resolves into its components to track the individual components of the previous canonical 
workflow. 


Case C: Standard software. In many research areas, such as materials science, a few large and well- 
known software packages are being used by the community to generate simulation results data. Such 
software packages are under an extensive review process by their creators. The code review component 
would be simplified to a formal check whether the software package is mentioned in some list of community 
accepted software. In other research areas, computations are performed by individually created scripts or 
programs. These require a separate review process. We note that evaluating the reproducibility of a program 
code must account for the variability of some parameters, for example, the same source code of a machine 
learning program might lead to different sets of outputs and weights depending on the machine [21]. 


Case D: In research with human subjects (e.g., clinical trials), data may more often be subject to 
restrictions due to ethical concerns and less so due to use constraints based on size. More complex or 
additional steps must be taken for ethical reasons or because of privacy laws, some of which may be difficult 
to automate (e.g., ethics review, curation). However, this does not fundamentally change the basic structure 
of the workflow. 


4. DISCUSSION 


In this paper we present the Reproducible Research Publication Workflow. A key aspect of the RRPW is 
the inclusion of artefact review and metadata creation as integral to the publication workflow. As noted by 
Velterop and Schultes [22], “the most important role of publishers and preprint platforms is to ensure that 
detailed, domain-specific, and machine-actionable metadata are provided with all publications. .. [and] all 
‘research objects’ that they publish.” The RRPW does not put the onus on publishers, or any one actor, 
alone. On the contrary, object and process metadata is accumulated and updated along the way in 
collaboration with authors and allied professionals, such as data curators and stewards and research 
software engineers, leading to metadata-rich and therefore more reusable published materials. Indeed, we 
encourage such professionals to bring their expertise to bear on the CWFR. 


314 Date 


Reproducible Research Publication Workflow: A Canonical Workflow Framework and-FAIR: + || 
Digital Object Approach to Quality Research Output 


In the context of quantitative research that uses statistical analysis methods, the RRPW can be considered 
a canonical workflow following the tradition of the CWFR. First, it includes recurring patterns with common 
practices “frequently resulting in fragmented and potentially irreproducible sequences that mix manual and 
machine-based steps” [23]. These actions are carried out by researchers, publishers, and curators, each of 
whom can have different implementations dependent on their role. Second, there is potential for integrating 
these actions into a CWFR by developing operations that follow import and export standards and that can 
be put into libraries and then be reused when needed. Third, the RRPW supports FAIR principles by, for 
example, ensuring the assignment of PIDs and the creation of rich comprehensive metadata including 
provenance information. 


This CWFR approach to publishing reproducible research has several advantages. First, the RRPW is a 
canonical, generic workflow that constitutes a standard that promotes data and code reuse through effective 
quality control and metadata. By elevating digital objects to a higher level of FAIR compliance, the RRPW 
contributes to research reproducibility and transparency. The RRPW facilitates accounting of all the digital 
objects, whether they are FAIR or not. 


Second, the RRPW has the potential to make a large number of digital objects visible and FAIR and 
represents a step towards the ability to handle a growing volume of digital objects using protocols and 
standards. Canonical components based on routine standards and protocols can enable more automated 
and scaled publication of reproducible research and contribute to the development of the Global 
Interoperable Data Space (with DOIP type of interactions) [12]. Moreover, the RRPW has the potential not 
only to accommodate multiple technologies but to inspire improved interoperability between those tools 
to better fit the standard workflow. 


Third, the workflow can be adapted at any point subject to the overarching goals of producing a digital 
object that is permanent and freely accessed and includes rich annotated and machine-readable metadata 
so it can be interpreted, understood and used by the Designated Community without having to resort to 
special resources not widely available. For example, the paper illustrates how the RRPW operates in the 
context of quantitative social science research, and we encourage other communities (e.g., qualitative 
methods, hermeneutics research) to consider relevant component libraries. Importantly, a core component 
of the RRPW is recognizing and codifying the complex relationships between scholarly findings and 
supporting artefacts, yet it is flexible enough to accommodate future representations of the findings (i.e., 
other than the traditional form of publication via PDF) and artefacts of various kinds. 


Fourth, researchers interested in using published research outputs will benefit from automated procedures 
for accessing these outputs, especially if they have small budgets. Researchers participating in the RRPW 
are encouraged to produce FAIR research output at the onset of a research project, thus reducing the burden 
of creating FAIR Digital Objects at the end of the research lifecycle. 
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Finally, and crucially, the RRPW can help address real world challenges by promoting interoperability 
for heterogeneous data via transdisciplinary exchange of DOs. For example, the Virus Outbreak Data 
Network (VODAN)® is committed to making the SARS CoV-2 virus data FAIR, to enable harnessing 
“machine-learning and future Al approaches to discover meaningful patterns in epidemic outbreak” [24]. 
Other examples from environmental and life science are discussed in Harjes et al. [25]. 


We note that it might be challenging in some cases to carry out review of data and software due to their 
specific nature and the specialized knowledge required to do so. For studies with smaller budgets, for 
example, it might be beneficial to reduce the number of curation steps. However, we maintain that the 
proposed application of the CWFR will help achieve the goals of transparency and reproducibility, increased 
FAIR compliance of all research artefacts at all steps, and the exchange of annotated and machine-readable 
metadata. 
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